Soumya D. Sanyal

Jan 10, 2016

Looking at the dataset - correlations

In this post, I'm going to try and see if I can elicit any relationships between the explanatory features and the explained features (expenditure).

This part of the project serves to aid in getting to actually know the data, and in hypothesis making, hypothesis refinement, and model selection; it essentially lays the foundation for the "story" that the project will tell.

I find this stage of the modeling process to be one of the most interesting. It's practically impossible to know if your priors about the problem you're looking at will influence the modeling process - I think they almost certainly will. At the same time, there's dramatic tension - will my hunch about the problem be supported by the data or will the data show me that it's completely false? Will another insight rise to take it's place? Will there be any story at all? All of these outcomes are entirely possible at this stage, and it's quite nerve wracking.

In any case, I can do my best by examining the data for as much supporting (or refuting) evidence in as objective a manner as possible.

Here are some of the questions I'm hoping to answer at the end of this post:

  1. What explanatory features are predictive (if any)?
  2. What explains the statistical variation in the dataset? What does the randomness look like? Is there a pattern? Surely it's not all just white noise!
  3. Which features are mutually correlated, and which are mutually uncorrelated? Do any features combine to create stronger relationships?

Next, we've accumulated a lot of boilerplate code that I need to load to start up my modeling session at this point. Rather than subject you to a lot of scrolling, I'm simply going to save it to a python module called blogging.py, and import it in one step. If you'd like to take a look at the environment I'm importing to build the analysis in this post, it's here.

In [1]:
from blogging import *

The next step in my analysis is to explore the data a bit more. I want to see if I can get any more insight into what the underlying data generating process is; the more I can rely on a clean look at the data, the better I'll be able to dispel my own incorrect assumptions about what is or should be happening.

One problem I do have to contend with in working with this dataset is that it comes as weighted survey data. Working with weighted data can be quite tricky. In the very first instance, since the unweighted data does not (by definition) accurately represent the population of interest, it's impossible to draw valid inferences from the standard exploratory (or even regression) procedure without accounting for the weights.

When constructing summary data, as in a histogram, or constructing a linear function of the data, this is not much of a problem. Any linear function $f(x)$ of the data is just as easy to compute as $f(Wx)$, where $W$ is a weight matrix, and variances will just be quadratic in $W$ as expected. However, there are myriad other challenges in working with weighted data; there's a relatively recent discussion of some of these challenges in this paper by Andrew Gelman.

For my immediate purposes - my Insight project - the challenge was to come up with some rough and ready ways to accommodate these weights. While the histogram methods in the Python ecosystem have options for specifying item level weights, not all EDA function calls do. In particular, scatterplots don't (as far as I know). Most model estimation calls that I looked into before starting this project did have some accommodations for item weights, so I figured that if I could get past the EDA stage, I might be fine.

I needed to find a way of getting a sense of the pairwise correlations in the dataset while accounting for the weights. In the end, I found that matplotlib hexbin plots had some support for item weights, and I figured that adding the weighted sum of points to hexbin cells would do well enough without my having to write my own visualization functions. In the end, this is what I constructed:

In [2]:
def look_at_pairs(x, y, x_range, y_range, w, colorbar=False, gridsize=200, bins=100, mincount=None):
    x_min, x_max= x_range[0], x_range[1]
    y_min, y_max = y_range[0], y_range[1]
    plt.hexbin(x, y, C = w, gridsize = gridsize, bins = bins,
       xscale = 'linear', yscale = 'linear',
       cmap=None, norm=None, vmin=None, vmax=None,
       alpha=1, linewidths=None, edgecolors='none',
       reduce_C_function = np.sum, mincnt=mincount, marginals=False,
          )
    plt.ylim([y_min,y_max])
    plt.xlim([x_min,x_max])
    if colorbar:
        plt.colorbar()
    plt.show()

Let's try this out on a an example pair of features.

In [3]:
look_at_pairs(x=data["AGE_AS_OF_12/31/13_(EDITED/IMPUTED)"], 
              y=data["TOTAL_OFFICE-BASED_EXP_13"], 
              x_range=[0,85], 
              y_range=[0,200000],
              w=data["FINAL_PERSON_WEIGHT_2013"],
              colorbar=True,
              gridsize=500,
              bins=85,
             )

In this graph, greater (weighted) mass in any one hexbin results in a higher color count (according to the colorbar scale on the right). More tightly packed data results in a greater plot density.

There looks to be a general tendency for older people to spend more on office based services in this picture, but my sense of the trend is obscured by the outliers above 100k. Of these outliers, most are in the 70-80 year range, with one below 20 years. There's not enough data for me to make a robust case about whether age correlates well with the incidence of really high expenditure, so I'm going to try looking at this again after cutting off the outliers.

In [4]:
look_at_pairs(x=data["AGE_AS_OF_12/31/13_(EDITED/IMPUTED)"], 
              y=data["TOTAL_OFFICE-BASED_EXP_13"], 
              x_range=[0,85], 
              y_range=[0,100000],
              w=data["FINAL_PERSON_WEIGHT_2013"],
              colorbar=True,
              gridsize=500,
              bins=85,
             )

This is helpful. It looks like a lot of weighted mass (the aqua region at the bottom) is concentrated near zero expenditure. However, it highlights the fact that expenditure trends higher with age (as does variance in expenditure).

It looks like this is a fairly useful visualization technique as far as quick and dirty goes.

Here are the other groups of correlations I want to inspect:

  1. Income
  2. Utilization
  3. Health conditions
  4. Duration without insurance
  5. Age of diagnosis
  6. BMI
  7. Demographic information
  8. Expenditures.

I also want to look at the four expenditure categories at the same time. Here's a quick method to help me do that:

In [5]:
def four_correlations(x,
                      x_range,
                      w,
                      xticks=None,
                      office_range=[0,100000],
                      outpatient_range=[0,26000],
                      inpatient_range=[0,320000],
                      er_range=[0,70000],
                      colorbar=False, 
                      gridsize=200, 
                      bins=100, 
                      mincount=None):
    
    term=x.name
    plt.figure(figsize=(15,15))
    
    for pos,y in [(1,data["TOTAL_OFFICE-BASED_EXP_13"]),
                  (2,data["TOTAL_OUTPATIENT_PROVIDER_EXP_13"]),
                  (3, data["TOT_HOSP_IP_FACILITY_+_DR_EXP_13"]),
                  (4, data["TOTAL_ER_FACILITY_+_DR_EXP_13"]),
                 ]:
        plt.subplot(2,2,pos)
        plt.hexbin(x,
                   y,
                   C = w,
                   gridsize = gridsize,
                   bins = bins,
                   xscale = 'linear',
                   yscale = 'linear',
                   cmap=None,
                   norm=None,
                   vmin=None,
                   vmax=None,
                   alpha=1,
                   linewidths=None,
                   edgecolors='none',
                   reduce_C_function = np.sum,
                   mincnt=mincount,
                   marginals=False,
                   )
        plt.title(y.name)

        if pos==1:
            plt.ylim(office_range)
        if pos==2:
            plt.ylim(outpatient_range)
        if pos==3:
            plt.ylim(inpatient_range)
        if pos==4:
            plt.ylim(er_range)
            
        plt.xlim(x_range)
        plt.xlabel(term)
        
        if xticks:
            plt.xticks(xticks[0],[thing.replace("NON-HISPANIC ","")[:15] for thing in xticks[1]], rotation=40)
        if colorbar:
            plt.colorbar()
        
        
    plt.tight_layout()
    plt.show()
    
    
In [6]:
four_correlations(data["AGE_AS_OF_12/31/13_(EDITED/IMPUTED)"], x_range=[0,85], w=data["FINAL_PERSON_WEIGHT_2013"],
                 colorbar=True,)

It looks like the same pattern holds for the other expenditure categories as for office expenditure when it comes to age.

Looking at income:

In [7]:
four_correlations(data["FAMILY\'S_TOTAL_INCOME"],
                  x_range=[0,500000],
                  w=data["FINAL_PERSON_WEIGHT_2013"],
                  office_range=[0,100000],
                  outpatient_range=[0,40000],
                  inpatient_range=[0,100000],
                  er_range=[0,40000],
                 colorbar=True)

Here you get the result that while the level of spending moves away from zero as income rises, the variance of expenditure in all four categories declines with income. An important question is whether the reduced variation in spending at higher incomes is simply a result of having fewer people at those income levels, or if there is a systematic effect (more effective health management, better bargaining power).

Looking at how utilization correlates with expenditure:

In [8]:
for term in utilizations:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[x_min,x_max]
    four_correlations(data[term], x_range=x_range, w=data["FINAL_PERSON_WEIGHT_2013"],
                     colorbar=True)
    print("\n\n\n")












I get the following out of these plots:

  1. Office based utilization and expenditure are positively correlated. Office based utilization is negatively correlated with all other expenditure.
  2. Outpatient utilization doesn't strongly correlate with outpatient expenditure. Outpatient utilization correlates somewhat negatively with other categories of expenditure.
  3. ER visits correlate negatively with other expenditures, and not particularly at all with ER expenditure.
  4. Inpatient visits correspond positively with inpatient utilization, and negatively with all other expenditure.

Next, I'll look at chronic health diagnoses. There are over 40 in my list, so it'll be useful to break them into groups.

Before we take a look at these plots, I want to quickly mention a very counterintuitive finding that I came across when I looked at these. For many of these chronic conditions, the subpopulation that does not have a positive diagnosis has a much heavier right tail than the subpopulation with the chronic condition.

On the face of it, it looked like people overall were at greater risk of spending more if they did not have a chronic illness than if they did. The paradox disappears when you zoom in on these plots around the first $\$10,000$ spent; it shows that many more people without a chronic condition are clustered at $\$0$ spent, relatively speaking. It's interesting to conjecture what may cause the thinner tails for chronic sufferers - one explanation may just be that the larger subpopulation is naturally accompanied by a larger variance.

The plots below shows the scatterplots to this limited scale for emphasis.

In [9]:
cancer=[term for term in health if "CANCER" in term]
lung=[term for term in health if "LUNG" in term or "BRONCH" in term or "ASTHMA" in term
     or "EMPHYSEMA" in term
     or "INH" in term]
cardio=[term for term in health if "BLOOD" in term or "CHOLESTEROL" in term or "CORON" in term or "HEART" in term
       or "STROKE" in term
       or "DIABETES" in term
       or "ANGINA" in term]
joint=[term for term in health if "JOINT" in term or "ARTHR" in term]
other=[term for term in health if term not in joint+cardio+lung+cancer]
In [10]:
for term in cancer:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,5000],
                      outpatient_range=[0,5000],
                      inpatient_range=[0,5000],
                      er_range=[0,10000],
                      colorbar=True,
                     )
    print("\n\n\n")







































Conclusions:

  1. A cancer diagnosis overall drives expenditure for office based services and inpatient services, and has less of a clear effect on outpatient spending and ER spending. For these latter two categories, there is greater variance in spending for the subpopulation without a positive cancer diagnosis.
  2. There is very little mass in the graphs measuring spending against the various subdiagnoses of cancer. It seems likely that there will be a corresponding lack of statistical significance of these features in any analysis I do, and it may be well to leave those out and just use the overall cancer diagnosis feature.
In [11]:
for term in lung:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")




































Conclusions:

  1. Diagnoses of emphysema, bronchitis, and asthma drive spending on office services and inpatient services. They have less of an effect on outpatient and ER services. The right tails in these latter categories is heavier in the negative group.
  2. The other features do not seem to be as predictive of increased spending.
In [12]:
for term in cardio:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")



























Conclusions:

  1. Diagnoses of high blood pressure, high cholesterol, coronary heart disease, angina, heart attack, other heart disease, stroke, and diabetes drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
  2. A multiple diagnosis of high blood pressure is probably not very predictive.
In [13]:
for term in joint:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")









Conclusions:

  1. Diagnoses of joint pain and arthritis drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
In [14]:
for term in other:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")















Conclusions:

  1. Diagnoses of ADD/ADHD, limitations in physical functioning, and incidence of pregnancy drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
In [15]:
for term in durations:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[0,52]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      #xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,40000],
                      outpatient_range=[0,20000],
                      inpatient_range=[0,800000],
                      er_range=[0,10000],
                      colorbar=True,
                     )
    print("\n\n\n")



Conclusion:

  1. It doesn't look like spending is correlated with time elapsed without health insurance.
In [16]:
for term in age_diagnosed:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[0,85]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      #xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,60000],
                      outpatient_range=[0,30000],
                      inpatient_range=[0,100000],
                      er_range=[0,40000],
                      colorbar=True,
                     )
    print("\n\n\n")




































Conclusion:

  1. There doesn't seem to be a strong correlation between the age of diagnosis for any of these chronic conditions and annual spending on any category of healthcare.
In [17]:
for term in demog:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[x_min-1,x_max+1]
    d={term:[list(interpretation[term].keys()),list(interpretation[term].values())] for term in interpretation}
    d["categorical"]=[[-1,1,2],["inapplicable","yes","no"]]
    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=d[term] if term in interpretation else d["categorical"],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")









Conclusions:

  1. No features except insurance status generate significant differences in outpatient expenditure.
  2. ER spending is affected by race, marital status, employment status, self-employment status, holding multiple jobs, union status, whether you are offered health insurance by your employer, your overall insurance status (notably whether you have private insurance or a public option), whether your household uses an FSA, where you live (notably, there is more zero clustering for ER spending in the South, and less in the Northeast), and educational attainment.
  3. Both office and inpatient spending are affected by race, marital status, employment status, having more than one job, being self employed, union status, whether health insurance is offered through employment, whether public or private options for health insurance are used, whether FSA's are used, insurance status, census region, and educational attainment.

Finally, a look at whether body mass index affects spending:

In [18]:
for term in bmi:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[0,50]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      #xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,100000],
                      outpatient_range=[0,40000],
                      inpatient_range=[0,100000],
                      er_range=[0,60000],
                      colorbar=True,
                     )
    print("\n\n\n")



From looking at this, there doesn't seem to be a strong correlation between spending and BMI in any category. I find this somewhat counterintuitive, since obesity is supposed to be correlated with poor cardiovascular health. I'm inclined to leave this feature in to see if there are any interesting interaction effects that may turn up in the modeling process.

At this point we have a better view of the data, and a sense of what features we expect to be predictive. This looks like a good foundation for a pretty rich modeling exercise.

In the next post, I'll write about the model selection process.

GitHub · LinkedIn · ·

© Copyright 2015 - 2016, Soumya D. Sanyal.